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Abstract 

Decision trees are a popular technique in statistical data classification. They recursively partition the feature space into disjoint 
sub-regions until each sub-region becomes homogeneous with respect to a particular class. The basic Classification and Regression 
Tree (CART) algorithm partitions the feature space using axis parallel splits. When the true decision boundaries are not aligned 
with the feature axes, this approach can produce a complicated boundary structure. Oblique decision trees use oblique decision 
boundaries to potentially simplify the boundary structure. The major limitation of this approach is that the tree induction algorithm 
is computationally expensive. In this article we present a new decision tree algorithm, called HHCART. The method utilizes a 
series of Householder matrices to reflect the training data at each node during the tree construction. Each reflection is based on the 
directions of the eigenvectors from each classes’ covariance matrix. Considering axis parallel splits in the reflected training data 
provides an efficient way of finding oblique splits in the unreflected training data. Experimental results show that the accuracy and 
size of the HHCART trees are comparable with some benchmark methods in the literature. The appealing feature of HHCART is 
that it can handle both qualitative and quantitative features in the same oblique split. 
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1. Introduction 

Decision trees (DTs) are an increasingly popular method used for classifying data. In the typical tree building 
procedure, the space that the data occupies (feature space) is iteratively partitioned into disjoint sub-regions until each 
sub-region is homogeneous (or near so) with respect to a particular class. In a DT, each sub-region is represented 
by a node in the tree. The node can be either terminal or non-terminal. Non-terminal nodes are impure and can be 
split further using a series of tests based on the feature variables, a process called splitting. Each split is determined 
by considering a series of hyperplanes which separate the feature space into two sub-regions. The best hyperplane 
split is chosen as the one which maximises the change in an impurity function (A(/)). To obtain a fully grown tree, 
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this process is recursively applied to each non-terminal node until terminal nodes are reached. The terminal nodes 
correspond to homogeneous or near homogeneous sub-regions in the feature space. Each terminal node is assigned 
the class label that minimises the misclassification cost at the node. 

DTs play an important role in statistical learning and have been a popular technique for data classification over several 
decades (see |l3][l2]|T5l). In the tree building process the aim is to produce accurate and smaller trees while minimising 
the computational time. Accuracy, size and time mainly depend on the way non-terminal nodes are split in a DT. Three 
types of splits are considered including axis parallel, oblique and non-linear splits. Axis parallel splits partition the 
space parallel to feature axes. Therefore axis parallel trees are desirable when the decision boundaries are aligned 
with the feature axes. Oblique splits are hyperplane splits defined by a linear combination of the feature variables. 
These splits are more appealing when the decision boundaries are not aligned with the feature axes. Non-linear splits 
Emni are general class of splits. Decision boundaries generated by these splits can take arbitrary shapes and can 
easily be influenced by noise data. 00]. 

Many algorithms have been proposed to induce DTs. In general, these algorithms differ in the way they search for 
the best split at each non-terminal node. Many studies show that trees which use oblique splits generally produce 
smaller trees with better accuracy compared with axis parallel trees Q. Therefore they have become increasingly 
popular in DT literature and motivated us to propose a new methodology to construct a DT which uses oblique splits 
at each non-terminal node. These DTs are called Oblique Decision Trees na. More specifically, let the feature vector 
consists of p attributes, x= [xj,X 2 ,... ,XpY where x, e K. The oblique splits can be defined as linear combinations of 
features of the form 

p 

^ UkXk + Qp+i < 0, where a\,a 2 ,... ,ap+i e K. (1) 

A=1 

One of the major issues when inducing an oblique DT is the time complexity of the induction algorithm. In a data 
structure with p feature variables and n examples at a non-terminal node, the number of splits to be evaluated to find 
the best axis parallel split is Oinp). Therefore, the globally optimal split (with respect to an impurity function) at a 
non-terminal node can be found by exhaustively searching all possible splits along the feature axes. However, the 
number of splits to be evaluated to find the best oblique split at a node by exhaustive search is at most O x 
GSl. Hence, an exhaustive search for the best oblique split is impractical. Furthermore, the best split at a node does 
not necessarily lead to the optimal tree. Spending more time searching for the best split at a node in general may not 
be beneficial Q. Furthermore, ||6| point out the problem of hnding an optimal binary DT is an NP-complete problem. 
This led us to search for efficient heuristics for constructing near optimal decision trees. In this work, we propose a 
simple, and effective heuristic method to induce oblique decision trees. 

The remaining sections of this paper are organized as follows: Section|^highlights related work. Sectionj^introduces 
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our proposed method. Comparisons with some commonly used DT algorithms are presented in Section]^ Section]^ 
concludes the paper with discussions. 

2. Related Work 

Most of the oblique DT induction algorithms construct DTs in a top-down fashion IfTSl . The induction algorithms 
differ in the way they search for the best split and can be categorised as follows. We define three categories; Induction 
algorithms that use optimisation techniques; standard statistical techniques; and those that use heuristic arguments. 


2.1. Tree induction methods based on optimisation techniques 

The first major oblique DT algorithm was Classification and Regression Trees - Linear Combination, which is 
commonly known as CART-LC |I3]. CART-LC uses a deterministic hill climbing algorithm to search for the best 
oblique split at a non-terminal node. A backward feature elimination process is also carried out to delete irrelevant 
features from the split. CART-LC will not necessarily find the best split at each node because there is no built in 
mechanism to avoid getting stuck in the local maxima of A(/). The best split found may be only a local, rather than 
global, maximiser of A(/). 

Simulated Annealing Decision Tree (SADT) was introduced by 0. This DT uses the simulated annealing op¬ 
timisation algorithm, which uses randomisation, to search for the best split. The use of randomisation potentially 
avoids getting stuck in local maxima of A(/) and will often produce better trees than those of CART-LC. The main 
disadvantage of the algorithm is the time taken to find the best split. In some cases it may require the evaluation of 
tens of thousands of hyperplanes before finding an optimal split ca. 

The concepts of CART-LC and SADT are combined to produce a new oblique DT methodology called OCl by US. 
Their method uses a deterministic hill climbing algorithm to perturb the coefficients of an initial hyperplane until a 
local maximum of A(/) is found. Then the hyperplane is perturbed randomly in an attempt to find a hyperplane that 
improves A(/) further. These two steps are repeated several times. Each time the algorithm starts with a different 
initial hyperplane, with one being the best axis parallel split and the others chosen randomly. After many hyperplanes 
have been evaluated, the one that maximises the A(/) is taken as the splitting hyperplane. The time complexity at each 
non-terminal node for OCl in the worst case scenario is shown to be (9 log(n)) provided that Max Minority or 
Sum Minority impurity measures are used. However, the complexity increases for other impurity measures and for 
multi-class problems. One feature of both SADT and OCl is that both algorithms can construct different decision 
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trees on different runs using the same learning sample. Therefore, it is possible to run these algorithms multiple times 


and pick the best tree. However, this advantage is only realised on relatively small training example sets. 

2.2. Tree induction methods based on standard statistical techniques 

Various oblique DT induction algorithms have been developed using standard statistical techniques and can be 
found in mm M and lfT2ll . The advantage of this approach is that the time required to induce DTs is generally 
lower than those based on optimisation algorithms. Quick Unbiased Efficient Statistical Tree (QUEST) IfTTl uses Lin¬ 
ear Discriminant Analysis (EDA) to find the best split at each node and hence there is no requirement for searching for 
the best split. QUEST’S axis parallel tree begins by performing an ANOVA test at each non-terminal node to select the 
best feature. EDA is then applied on the selected feature to find the best splitting point. QUEST’S oblique DT simply 
applies EDA on all features to find the best splitting hyperplane. Furthermore, QUEST is able to find oblique splits 
which are a linear combination of qualitative and quantitative features. For multi-class problems, QUEST groups 
the classes into two super-classes using 2-means clustering algorithm and this increases the time complexity of the 
algorithm. 


2.3. Tree Induction Methods based on Heuristics 

DTs based on heuristic arguments have gained more popularity in recent past. In this approach, a logic is con¬ 
structed by assuming structure of class boundaries. If the assumption is true, DTs based on heuristic arguments 
produce accurate and smaller trees. DTs based on heuristic arguments can be found in |[T] and IfTTl . 

The CARTopt algorithm introduced by IfTTl . uses a two class oblique tree to find a minimiser of a nonsmooth 
function /(x) where x e K". Initially the examples in K" are labelled as high and low depending on their value of 
/(x). An oblique DT is then used to form a partition on K" which separates the low points from high points. Rather 
than forming the oblique DT directly, the authors reflected the training examples using a Householder matrix. Axis 
parallel splits are then searched in the reflected training data. These splits are oblique in original space. 

CARTopt introduces a new heuristic to induce oblique decision trees. It uses the simplest form of splits, axis parallel 
splits, to find oblique splits. Hence time complexity of searching oblique splits using CARTopt’s approach is less than 
those based on optimisation algorithms. In this study we extend the CARTopt’s idea in a number of ways to develop 
a complete oblique DT for statistical data classification. 


4 


/ 00 ( 2015 ) j-f7I| 


5 


3. Methodology 

We extend the oblique DT method used in the CARTopt optimisation algorithm of ini in a number of ways to de¬ 
velop a complete oblique DT called HHCART. First, CARTopt is designed to classify two classes whereas HHCART 
can handle multi-class classification problems. Second, CARTopt reflects the training examples at the root node only 
whereas HHCART performs reflections at each non-terminal node during tree construction. Finally, CARTopt is only 
defined for quantitative features whereas HHCART is capable of finding oblique splits which can be linear combina¬ 
tions of both quantitative and qualitative features. 

First, we explain the basic concept of our algorithm for a two class classification problem. The algorithm easily gen¬ 
eralises to the multi-class problem. In our approach we And each separating hyperplane by considering the orientation 
of each class. We propose the dominant eigenvector of the covariance matrix of a class to represent the orientation 
of that class. If this orientation is parallel to one of the feature axes, the best separating hyperplane may be found 
by performing axis parallel splits. Otherwise, we reflect the set of examples to a new coordinate system such that 
the orientation of one of the classes becomes parallel to one of the axes in the reflected feature space. Axis parallel 
splits can then be searched in the reflected feature space to And the best split. This split will be oblique in the original 
feature space IIT71 . 

Consider the two dimensional, two class classification problem shown in Figure[T](a). 

First we define the estimated covariance matrix of a set of examples. Let Xi,X 2 ,... ,Xn be p dimensional feature 



Figure 1. Mechanism of the Householder Reflection, a). Scatter in the original space, d’ is the dominant eigenvector of the class 
covariance matrix of class one. b). Scatter in the reflected space and the best axis parallel split found, c). Oblique split in the 
original space. 


vectors where X; = (xi],Xi 2 ,. ■ ■ ,XipY. Then the estimated covariance matrix is given by 
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S - 2"=i (xi “ x)(xi - x)^ where x = ■ ■ ■ ,x^Y is the mean vector. 

We reflect the examples using a Householder matrix which can be defined as follows. Let d' be the dominant eigen¬ 
vector of the estimated covariance matrix of class 1 examples. Then there exists an orthogonal symmetric matrix Hpxp 
(where p is number of features) such that 

T Cl - d' 

H - I - 2uu where u = ---r— 

llei-dHb (2) 

andeipxi = (1,0,... ,0)’^. 

Let Dnxp be the training example set. The reflected example set 'Dnxp is obtained using 2) = 2)//. Since Hp^p is 
symmetric and orthogonal, a point in the transformed space can be mapped back to original space at a minimal cost 
{HH = I). The mechanism of the Householder reflection is that it reflects vector on to ei by a reflection through 
the plane perpendicular to vector ei - d^. The reflected example set is shown in Figure[2(b). 

Each column of H represents the direction of a coordinate axis in the reflected space. Axis parallel splits are searched 
along these axes. These splits are oblique in the original space. The best axis parallel split found in the reflected space, 
which is oblique in the original space, is shown in Figure[T](c). 

The axis parallel search space can be enhanced by using all possible eigenvectors for reflections. For a p-dimensional 
classification problem with C classes there are Cp eigenvectors to be considered for the Householder reflection. 
However, this increases the time complexity of tree induction, but have an opportunity to produce better trees. 

Here we explain the complete algorithm of HHCART. We propose two versions of HHCART: HHCART(A) is based 
on all possible eigenvectors of all classes and HHCART(D) is based on only the dominant eigenvector of each class. 
For any given non-terminal node f, let2)f and C? be the set of examples and classes available at that node respectively. 
At node f, HHCART(A) finds all eigenvectors of the estimated covariance matrix for each class whereas HHCART(D) 
finds only the dominant eigenvector of each class. A Householder matrix is constructed for each eigenvector. Then 
2),is reflected using each Householder matrix, and axis parallel splits are performed along each coordinate axis in the 
reflected space. The best axis parallel split is chosen as the separating hyperplane at node t. However, if the eigenvector 
is already parallel to any of the feature axes, no reflection is done and hence axis parallel splits are searched in the 
original space. The hyperplane found divides node t into two child nodes. The algorithm is recursively run on all 
child nodes until each child node satisfies either: 

a. the misclassiflcation rate at the child node is either 0 or not greater than a user specified threshold (MisRate); or 

b. the number of examples in the node is less than or equal a user specified threshold (MinParent). 
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Data: Input: Examples at node t, called S),, Minparent, MisRate, and t > 0. 
initialization; 

Define N, - Number of examples in S,; 

Define mpt - misclassification rate at node f; 

Define Q = number of classes at node f; 

Define p - number of features; 

A(W) = 0; 
h, — empty, 

if (Nt > Minparent) and (MisRate < mpt) then 
for i-l:Ci do 

Extract the the examples that belong to the i'* class in 23,, called D, ; 

Compute the normalized eigenvectors and eigenvalues of estimated covariance matrix for D, ; 


d'O, ■ ■ ■ idP‘, AP‘)) 

for j-l:p do 
if AP + 0 then 

if ||ei - dP\\ <T or \\e2 - d^W <t or ... or \\ep - d^W < r then 
Hf - I, the Identity matrix; 
else 

Construct the Householder matrix Hf using d^ ; 

end 

Reflect Dt: t>t = * Hj'-, 

Eind the best axis parallel hyperplane split, called /if; 
if impurity reduction ofhj' > A(Imax) then 

Replace h, with hf, the best hyperplane found so far; 


Replace A(Imax) with the impurity reduction of hf 


end 


end 


end 


end 


end 


Algorithm 1: Overview of HHCART(A) algorithm at a single node 


An overview of HHCART(A) algorithm at node t is given in Algorithm [T| The time complexity at a node for HH- 
CART(A) in the worst case is O(Cn^p^) (See appendix A for the derivation). However, if HHCART(D) is used the 
time complexity reduces to 0(Cn^p^). 

3.1. Small Samples 

As the tree grows, the number of examples at each node usually becomes small. This raises two questions to be 
answered, (a) Is it worthwhile searching for an oblique split or is an axis parallel split sufficient? (b) How are the 
eigenvectors calculated for small sample sizes? The first problem is common for any oblique DT. In the OCl algorithm 
the authors suggest using oblique splits if the number of examples at a node is greater than twice the number of feature 
variables. The second question has two parts: 
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1 Non-availability of some eigenvalues (the ones equal to zero) due to a singular covariance matrix. 

2 Performing an eigen analysis for classes having only one example or several examples with the same feature 
vector. 

Part (1) can be solved without modifying the HHCART algorithm because the reflection is done using the eigen¬ 
vectors whose eigenvalues are not zero. For part (2), we simply omit classes that have a single example or several 
examples with the same feature vector. However, if all the classes suffer from this problem, then axis parallel splits 
are performed. 

3.2. Qualitative Variables 

Data classification problems often contain a mixture of quantitative and qualitative feature variables. Since the 
class discriminatory information may be contained in both types of feature variables, an effective classifier should be 
able to handle both types of features in the classification process. For a qualitative feature variable X, the form of the 
split is given hy X e A where A is a non-empty subset of values taken by X. If a qualitative feature has M non-empty 
levels, 2^ ' - 1 splits are possible. Axis parallel algorithms which consider qualitative splits can be found in ifTbll . 
Incorporating qualitative features in oblique splits has not been explored much. The QUEST algorithm is capable 
of finding oblique splits with both qualitative and quantitative features. QUEST transforms each unordered qualitative 
feature variable into a new ordered quantitative feature variable. Each level of an unordered qualitative feature is 
mapped to a ordered value called a CRIMCOORD. The exact CRIMCOORD algorithm can be found in IfTTl . We 
implement the same CRIMCOORD algorithm in HHCART to induce oblique splits which contain both qualitative and 
quantitative features. At each node, a new quantitative feature is constructed for each qualitative feature by mapping 
its levels to CRIMCOORDS. Then these new quantitative features are amalgamated with the existing quantitative 
features in the example set. The HHCART algorithm can then be applied to And the best oblique split. At each node 
the CRIMCOORD corresponding to each level of each qualitative feature is stored. When predicting, the level of 
each qualitative feature of an unclassified observation is replaced by the corresponding CRIMCOORD attached to 
each node along its path. 

4. Experiments 

Two sets of experiments are carried out to compare the performance of HHCART with other DT methods. The first 
experiment considers quantitative example sets and the second experiment considers example sets with both qualitative 
and quantitative features. Both HHCART(A) and HHCART(D) methods are considered in the experiments. 
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4.1. Comparison on datasets having quantitative features only 

In this section, we compare the HHCART methods with OCl, OCl-LC (OCl version of Breiman’s linear com¬ 
bination methods) and OCl-AP (OCl version of axis parallel splits). All of these methods are available in the OCl 
system which is freely available at na. However, the backward feature elimination process of Breiman’s CART-LC 
method is not included in OCl-LC and hence is somewhat different from the original method. 

4.1.1. Experimental Setup 

Experiments were performed on real data sets that were downloaded from m and are given in Table [T] In our 
algorithm we set MinParent=2, MisRate=0 and t = 0.05. For OCl, OCl-LC and OCl-AP MinParent was set to 2. 
All the algorithms used the Twoing rule as the measure of impurity a and Cost complexity pruning 0 with zero 
standard error. For OCl, the number of restarts and number of jumps were set to 20 and 5 (default values) respectively. 
Five-fold cross validations were used to estimate the classification accuracy. For each fold, 10% of the training set 
was used exclusively for pruning. We then used ten, five-fold cross validations to estimate the accuracy and the size 
of the tree. Therefore, to estimate accuracy and tree size (number of terminal nodes) the average over ten runs was 
used. Results are reported in Table along with respective standard deviations. The Shuttle data set comes with its 
own training set containing 43500 examples and a test set with 14500 examples. Therefore instead of performing a 
cross validation experiment, we induced 10 trees, each using 90% of training examples for induction and remaining 
10% for pruning. The accuracy of all the trees was estimated using the Shuttle data test set. Since approximately 80% 
of the examples belong to class 1, the aim is to achieve an accuracy between 99 - 99.9% ||2l- 


Table 1. Real Data sets with quantitative features, downloaded from UCI Repository 


Data set 

No. of 

feature 

No. of 
classes 

No. of 
examples 

Heart (HRT) 

13 

2 

270 

Pima Indian (PIND) 

8 

2 

768 

Breast Cancer (BC) 

9 

2 

638 

Boston Housing (BH) 

13 

2 

506 

Wine(WINE) 

13 

3 

178 

BUPA 

6 

2 

345 

Balance Scale (BS) 

4 

3 

625 

Glass (GLS) 

9 

7 

214 

Shuttle (SHUT) 

9 

7 

58000 

Letter (LET) 

10 

26 

20000 

Survival (SUR) 

3 

2 

306 
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Table 2. Results of HHCART and other DT methods 


Dataset 

DT 

Avg. Acc. 

Avg. Size 

Dataset 

DT 

Avg. Acc. 

Avg. Size 

BS 

HHCART(A) 

93.7 ± 1.3 

7.9 ± 1.7 

PIND 

HHCART(A) 

72.2 ±2.0 

9.1 ±5.1 


HHCART(D) 

88.3 ± 1.7 

12.2 ±3.5 


HHCART(D) 

72.9 ± 1.3 

10.8 ±4.4 


OCl 

91.9 ±0.9 

8.7 ± 3.4 


OCl 

73.4 ± 1.0 

9.2 ± 5.4 


OCl-AP 

78.2 ± 1.3 

37.5 ± 16.8 


OCl-AP 

73.6 ± 1.4 

15.9 ±8.7 


OCl-LC 

84.3 ± 1.5 

12.6 ±6.5 


OCl-LC 

72.8 ± 1.8 

11.4 ±9.6 

BH 

HHCART(A) 

83.3 ± 0.9 

6.5 ±2.1 

SHUT 

HHCART(A) 

99.94 ± 0.02 

25.4 ± 5.9 


HHCART(D) 

83.0 ±0.7 

9.9 ± 2.6 


HHCART(D) 

99.94 ± 0.05 

26.1 ±4.9 


OCl 

82.2 ± 1.2 

9.3 ±3.4 


OCl 

99.95 ± 0.03 

32.6 ±7.71 


OCl-AP 

82.0 ±0.7 

13.0 ±5.3 


OCl-AP 

99.97 ± 0.02 

26.5 ± 5.6 


OCl-LC 

81.5 ± 1.3 

10.6 ±6.0 


OCl-LC 

88.4 ± 7.07 

44.7 ± 42.4 

BC 

HHCART(A) 

97.0 ± 0.3 

2.4 ± 0.6 

WINE 

HHCART(A) 

91.3 ± 1.6 

3.4 ± 0.3 


HHCART(D) 

97.0 ± 0.3 

2.6 ± 1.1 


HHCART(D) 

88.7 ± 3.1 

4.5 ± 0.6 


OCl 

95.4 ±0.5 

3.3 ±1.4 


OCl 

89.2 ±2.1 

3.5 ±0.3 


OCl-AP 

94.0 ± 0.8 

8.3 ±3.3 


OCl-AP 

89.2 ±4.6 

4.6 ± 0.6 


OCl-LC 

95.5 ± 0.6 

3.4 ± 1.6 


OCl-LC 

89.4 ±2.7 

3.8 ±0.6 

BUPA 

HHCART(A) 

64.1 ±2.6 

6.5 ± 1.5 

LET 

HHCART(A) 

82.1 ±0.3 

759.2 ±88.1 


HHCART(D) 

62.4 ± 2.5 

8.6 ±3.1 


HHCART(D) 

83.1 ±0.3 

1135.9 ± 122 


OCl 

66.9 ± 2.2 

8.9 ±6.1 


OCl 

83.6 ±0.4 

1197.2 ± 88.9 


OCl-AP 

64.7 ± 2.5 

13.2 ± 10.5 


OCl-AP 

86.3 ± 0.3 

1611.7 ±60.0 


OCl-LC 

64.4 ± 2.4 

8.9 ±3.6 


OCl-LC 

84.5 ± 0.2 

1332.6 ± 146.3 

GLS 

HHCART(A) 

60.3 ± 3.0 

8.5 ± 3.0 

SUR 

HHCART(A) 

73.5 ± 1.5 

5.3 ± 2.7 


HHCART(D) 

61.9 ±3.0 

10.1 ±2.3 


HHCART(D) 

72.8 ± 1.0 

5.0 ±2.4 


OCl 

61.1 ±3.5 

10.8 ±4.3 


OCl 

7L0±2.1 

6.4 ± 3.5 


OCl-AP 

64.6 ± 3.9 

14.6 ± 8.7 


OCl-AP 

71.9 ± 1.5 

10.7 ± 6.5 


OCl-LC 

67.4 ± 2.0 

12.0 ±3.6 


OCl-LC 

70.2 ± 2.4 

8.1 ±4.4 

HRT 

HHCART(A) 

74.1 ±2.9 

4.5 ± 1.7 






HHCART(D) 

75.8 ±2.8 

7.8 ±2.6 






OCl 

77.1 ± 2.5 

3.6 ± 1.0 






OCl-AP 

76.3 ±2.3 

6.7 ± 2.4 






OCl-LC 

76.3 ± 2.5 

4.0 ± 1.1 






Table 1^ shows the results for our first experiment. The average accuracies and the average tree sizes of ten, five¬ 
fold cross classifications are listed in the table. It is clear that oblique splits reduce the average tree size for all the data 
sets while increasing the accuracy for most of data sets. The average accuracy of HHCART(A) is significantly (more 
than 2 standard deviations) higher than all the other methods tested for the BC dataset. Also the average accuracy of 
HHCART(A) is higher than the other methods for BS, BH, WINE and SUR datasets. 

The average tree sizes of HHCART(A) are consistently smaller than the average tree sizes of other methods except 
for the HRT dataset. Therefore the performance of HHCART(A) with respect to accuracy and tree size is better than 
the other methods for BS, BH, BC, WINE and SUR datasets. 

Eight of the eleven datasets have at least 8 features. Eor six of these relatively high dimensional data sets, the perfor¬ 
mance of HHCART(A) is comparable with OCl and OCl-LC. Therefore we can conclude that the proposed method 
works well in relatively high dimensional feature spaces. 

Eor all the datasets except BS and WINE, HHCART(D) performs as well as HHCART(A) in terms of the average 
accuracy. Also the tree sizes of HHCART(D) are comparable with those produce by HHCART(A) except for the BH, 
BUPA and HRT datasets. The performance of HHCART(D) is as similar as OCl with respect to both the accuracy and 
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tree size for all the datasets except the BS, HRT and BUPA datasets. The time complexity of HHCART(A) is higher 
than that of HHCART(D) by a factor of 0{p). Results show that HHCART(D) produces DTs with similar accuracies 
and sizes as HHCART(A) and OCl for most of the datasets. Hence HHCART(D) would be a more efficient method 
to use for higher dimensional problems. 

4.2. Comparison on Datasets having qualitative and quantitative features 

Experiments were performed to study the performance of the HHCART methods when the training examples 
contain both qualitative and quantitative features. Since OCl, OCl-AP, and OCl-LC are not designed to handle 
oblique splits containing both qualitative and qualitative features, QUEST IfTTII was used for comparison purposes. 

4.2.1. Experimental setup 

Experiments were performed on the datasets available in 121, which are given in Table Ten, five-fold cross 
validations were used and the average accuracies and tree sizes (over ten cross validations) are reported in Table 
The Income dataset comes with its own training and testing set of 30162 and 15060 examples respectively. We induced 
10 trees, each using 90% of the training examples and the remaining 10% were used for pruning. The accuracy of 
all the trees were estimated using the same test set. QUEST uses the following parameter setting: estimated priors. 

Table 3. Real Data sets with qualitative and quantitative features, downloaded from UCI Repository 


Data Set 

No. of features 
(No. of Qualitative) 

No. of 
Classes 

No. of 
Examples 

Income 

14(8) 

2 

45222 

Bank 

16(9) 

2 

45211 

StatLog 

14(8) 

2 

690 


unit misclassification cost, zero standard error for pruning, linear splits, linear discriminant analysis for the split point, 
minimum node size for splitting -2. The HHCART methods were implemented as above. For the Income dataset, 
HHCART(A)’s performance is significantly (more than 2 standard deviations) better than QUEST both in terms of the 
average accuracy and average tree size. For the other two datasets, HHCART(A) produces comparable accuracies with 
smaller trees. These results also suggest that the HHCART algorithms perform well in relatively high dimensions. 
Though HHCART(D) produces larger trees compared with HHCART(A), its classification accuracy is comparable 
with HHCART(A). 
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Table 4. Results of HHCART and QUEST 


Dataset 

Decision Tree 

Avg. Acc. 

Avg. Size 

Income 

HHCART(A) 

HHCART(D) 

QUEST 

85.1 + 0.2 
85.5 ± 0.2 

83.9 + 0.2 

32.7 + 12.9 

59.5 + 19.7 
68.0 + 23.1 

Bank 

HHCART(A) 

HHCART(D) 

QUEST 

90.2 + 0.12 
90.4 + 0.07 
90.1+0.1 

22.58+ 11.94 
44.4+14.19 
27.0+ 15.2 

StatLog 

HHCART(A) 

HHCART(D) 

QUEST 

85.1+0.9 
85.8 + 0.7 

85.65 + 0.92 

5.6+ 1.9 

6.5 + 3.0 

6.08 + 3.6 


5. Conclusions 

In this work we have presented a new way of inducing oblique DTs called HHCART. It uses the eigenvectors of the 
estimated covariance matrices of respective classes to define a Householder matrix which is used to reflect the exam¬ 
ples so that reflected axis parallel splits can be found. Two versions of HHCART have been presented: HHCART(A) 
uses all possible eigenvectors of the estimated covariance matrices of respective classes whereas HHCART(D) uses 
only the dominant eigenvector of each class. Based on the empirical results obtained, it is clear that both HHCART 
methods perform well in terms of accuracy and tree size. Furthermore, HHCART is capable of classifying datasets 
with both qualitative and quantitative features. 


Appendix A. Time Complexity of HHCART 

Here we derive the maximal time complexity at a node of HHCART(A) and HHCART(D). Assume there are n 
examples with p quantitative features and C classes at the node. 

1 HHCART(A) and HHCART(D) - Complexity for constructing estimated covariance matrix for one class of 
examples is 0{np^). For C classes the complexity is 0(Cnp^). 

2 HHCART(A) - Complexity of the complete eigen analysis for one class of examples is O(p^). For C classes 
the complexity is 0{Cp^). 

HHCART(D) - Complexity for finding the dominant eigenvector for one class of examples is 0{p^). For C 
classes the complexity is 0(Cp^). 

3 HHCART(A) - Complexity for the reflection of n examples using one Householder matrix is 0{np^). Since 
there are Cp Householder matrices the Complexity is 0{Cnp^). 
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HHCART(D) - Complexity for the reflection of n examples using one Householder matrix is O(np^). For C 
Householder matrices the complexity is 0(Cnp^). 

4 HHCART(A) - Complexity of finding the best axis parallel splits for one reflected space is 0{n^p). Since there 
are Cp reflected spaces the Complexity is 0(Cn^p^). 

HHCART(D) - Complexity of finding the best axis parallel splits for one reflected space is OirP'p). For C 
classes the complexity is OiCrP'p) 

5 HHCART(A) - The maximal time complexity at a node is 0(Cnp^)+0(Cp^)+0(Cnp^)+0(CrP'p^) - 0(Cn^p^). 
HHCART(D) - The maximal time complexity at a node is 0{Cnp^)+ 0(Cp^)+0(Cnp^)+0(Cn^p) — OiCrP'p^). 
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